This project analyzes the physicochemical properties that affect the quality of 1599 variants of the Portuguese “Vinho Verde” red wine.
The physicochemical properties in the data set are based on objective tests.
The quality of each wine is graded from 0 (very bad) to 10 (very excellent), based on the median of at least 3 evaluations by wine experts.
My objective is to determine which of the physicochemical properties affect wine quality, and then build a linear model based on those factors to predict quality.
| Property | Unit | Description |
|---|---|---|
| Fixed acidity | gm/L | Most acids involved with wine or fixed or nonvolatile (do not evaporate readily). |
| Volatile acidity | gm/L | The amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste. |
| Citric acid | gm/L | Found in small quantities, citric acid can add ‘freshness’ and flavor to wines. |
| Residual sugar | gm/L | The amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet. |
| Chlorides | gm/L | The amount of salt in the wine. |
| Free sulfur dioxide | mg/L | The free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine. |
| Total sulfur dioxide | mg/L | Amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine. |
| Density | gm/mL | The density of wine is close to that of water depending on the percent alcohol and sugar content. |
| pH | Describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale. | |
| Sulphates | gm/L | A wine additive which can contribute to sulfur dioxide gas (S02) levels, which acts as an antimicrobial and antioxidant. |
| Alcohol | % | The percent alcohol content of the wine. |
Wine Quality Data Set Information
Confirming that all 1599 rows were loaded.
## [1] 1599
Confirming that all columns were loaded.
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
Sample data shows that quality has discrete numeric values, and all physicochemical properties have continuous numeric values.
## X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1 7.4 0.70 0.00 1.9 0.076
## 2 2 7.8 0.88 0.00 2.6 0.098
## 3 3 7.8 0.76 0.04 2.3 0.092
## 4 4 11.2 0.28 0.56 1.9 0.075
## 5 5 7.4 0.70 0.00 1.9 0.076
## 6 6 7.4 0.66 0.00 1.8 0.075
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 11 34 0.9978 3.51 0.56 9.4
## 2 25 67 0.9968 3.20 0.68 9.8
## 3 15 54 0.9970 3.26 0.65 9.8
## 4 17 60 0.9980 3.16 0.58 9.8
## 5 11 34 0.9978 3.51 0.56 9.4
## 6 13 40 0.9978 3.51 0.56 9.4
## quality
## 1 5
## 2 5
## 3 5
## 4 6
## 5 5
## 6 5
A summary and plot of quality shows that wines in the data set have grades between 3 and 8. None of the wines are close to being very bad or very excellent.
For the purpose of analysis, I have categorized grades 3 and 4 as Low, grades 5 and 6 as Medium, and grades 7 and 8 as High.
About 4% of the wines are of Low quality. 82.5% of the wines are of Medium quality. 13.5% of the wines are of High quality.
##
## 3 4 5 6 7 8
## 10 53 681 638 199 18
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
##
## Low Medium High
## 63 1319 217
##
## Low Medium High
## 0.03939962 0.82489056 0.13570982
Some wine qualities are outliers (grades 3 and 8) for the provided data. However, they do belong in the data set for further analysis as each wine undergoes at least three evalauations, and hence cannot be errors.
## [1] 3 8
The values look normally distributed with few outliers, with most wines here having fixed acidity in the range of 5 to 11 gm/L.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
Volatile acidity also seems to be normally distributed, most wines here having acetic acid in the range of 0.2 to 0.9 gm/L.
The mean and median are close, 0.53 and 0.52 respectively.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
Over 8% of the wines here (132 of 1599) have no citric acid, the rest having less than 0.8 gm/L.
Mean and median are close, 0.27 and 0.26 respectively.
##
## FALSE TRUE
## 1467 132
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
Most wines here have between 1 and 3 gm/L of residual sugar, with some outliers having up to 15 gm/L.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
Most wines here have 0.04 to 0.12 gm/L of salt, with a few outliers.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
The distribution of values is long tailed, with most wines here having free sulfur dioxide in the range of 3 to 40 mg/L, and total sulfur dioxide in the range of 6 to 150 mg/L.
Only wines with over 50 ppm (mg/L) of free sulfur dioxide concentrations are detectable to affect the nose and taste. Just 1% (16 of 1599) of the wines here have free sulfur dixoide concentrations above 50 ppm.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
##
## FALSE TRUE
## 1583 16
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
Wine density looks normally distributed, in a close range between 0.990 and 1.004 gm/mL.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0040
pH levels look normally distributed, mainly in the range of 3.0 to 3.6.
The mean and median are identical at 3.31.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
Sulphates content is mostly between 0.45 to 0.9 gm/L. There are some outliers on the higher end of the value range.
The mean and median are close, 0.62 and 0.6581 respectively.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
Most wines here have an alcohol content between 9 and 13%.
Mean and median are a little over 10%.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
The following charts box plot each property against quality. The pattern on the charts indicate a linear relation between quality and some of the properties such as:
The data set description states that “citric acid can add ‘freshness’ and flavor to wines”. The box plot does indicate that, but I want to explore it further.
About 8% of the wines here (132 of 1599) have no citric acid. I want to compare the distribution of grades for wines with and without citric acid.
The charts below show that only 6% of wines with no citric acid (8 of 132) are of High quality (grade 7 or better).
Over 14% of wines with citric acid (209 of 1467) are of High quality.
The data analysis seems to confirm that the presence of citric acid does influence wine quality positively.
##
## 3 4 5 6 7 8
## 7 43 624 584 191 18
##
## 3 4 5 6 7
## 3 10 57 54 8
The data set description states that “at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine”. Only 1% of the wines here (16 of 1599) meet that criteria, but I want to explore if free SO2 could impact wine quality.
About 13.5% of the wines here (215 of 1581) with free SO2 concentrations below 50 ppm are of High quality (grade 7 or better).
On the other hand, only 12.5% of the wines here (2 of 16) with free SO2 concentrations over 50 ppm are of High quality.
Although this finding contradicts conventional wisdom, it is worth noting that no wine here with over 50 ppm free SO2 concentration is of Low quality (grade 4 or lower).
##
## 3 4 5 6 7 8
## 10 53 671 632 197 18
##
## 5 6 7
## 9 5 2
The correlation coefficients confirm that volatile acidity, sulphates and alcohol influence wine quality to an extent that could be significant, and citric acid and density to a lesser degree.
## property r min max
## 1 fixed.acidity 0.12405165 4.6000 15.900
## 2 volatile.acidity -0.39055778 0.1200 1.580
## 3 citric.acid 0.22637251 0.0000 1.000
## 4 residual.sugar 0.01373164 0.9000 15.500
## 5 chlorides -0.12890656 0.0120 0.611
## 6 free.sulfur.dioxide -0.05065606 1.0000 72.000
## 7 total.sulfur.dioxide -0.18510029 6.0000 289.000
## 8 density -0.17491923 0.9901 1.004
## 9 pH -0.05773139 2.7400 4.010
## 10 sulphates 0.25139708 0.3300 2.000
## 11 alcohol 0.47616632 8.4000 14.900
Citric acid being non-volatile has a positive linear relationship with fixed acidity.
From the scales it seems citric acid is a small part of overall fixed acidity.
Acetic acid, the volatile acid in wine, seems to be lower in wines when citric acid is higher.
There is a positive linear relationship between fixed acidity and density.
This is to be expected as the fixed acids found in wine are denser than water (1 gm/mL).
| Fixed Acid in Wine | Density |
|---|---|
| Tartaric Acid | 1.79 gm/mL |
| Malic Acid | 1.61 gm/mL |
| Citric Acid | 1.67 gm/mL |
| Succinic Acid | 1.56 gm/mL |
As expected, pH levels have an inverse linear relationship with fixed acidity. The lower the pH level, the more acidic the solution.
The higher the residual sugar and salt in a wine, the denser it seems to be.
Measuring wine density is a method for categorizing it as Dry, Medium-sweet or Sweet.
The plot shows an inverse linear relationship between alcohol and density, the higher the alcohol content the lower the density.
The fermentation process converts sugars in grape juice to ethanol (ethyl alcohol). Density of ethanol is 0.789 gm/mL, which is lower than that of water (1 gm/mL).
These density charts indicate that there is a higher probability that wines with better grades have higher alcohol, sulphates and citric acid content but lower volatile acidity.
The three histograms below confirm that wines of lower quality tend to have lower alcohol and sulphates content but higher volatile acidity, and vice versa.
On the following histograms, wines of all qualities seem to be spread across the range of values of citric acid and density, indicating a weaker correlation between those properties and quality.
Plots, statistical measures and analysis thus far seem to indicate that there is a strong correlation between the three physicochemical properties alcohol, volatile acidity and sulphates, and wine quality.
The following plots provide further confirmation that these three properties can be significant in predicting wine quality.
This plot indicates that wines of High quality tend to have higher alcohol content and lower volatile acidity.
This scatter plot indicates that Low and Medium quality wines are concentrated at points where sulphates content is lower and volatile acidity is higher.
This chart shows that as alcohol and sulphates content increases, wine quality gets better.
| Property | Linear Relation | Comments |
|---|---|---|
| Fixed acidity | Weak, positive. | - |
| Volatile acidity | Medium to strong, negative. | Acetic acid at high levels can lead to an unpleasant, vinegar taste. |
| Citric acid | Weak to medium, positive. | Citric acid can add ‘freshness’ and flavor to wines. |
| Residual sugar | Very weak, positive. | All wines in the data set are fairly dry. The highest residual sugar level is 15.50 gm/L. Only wines over 45 gm/L are considered sweet. |
| Chlorides | Weak, negative. | - |
| Free sulfur dioxide | Very weak, negative. | - |
| Total sulfur dioxide | Weak, negative. | - |
| Density | Weak, negative. | Density of water is 0.99997 gm/mL. All wines in the data set are close to the density of water (0.99 to 1.004). |
| pH | Very weak, negative. | On a scale of 0 (very acidic) to 14 (very basic) most wines are between 3-4. All wines in the data set are between 2.74 and 4.01. |
| Sulphates | Medium, positive. | Sulphates contribute to sulfur dioxide gas (S02) levels, which acts as an antimicrobial and antioxidant. Interestingly, free and total sulfur dioxide levels do not seem to impact quality. |
| Alcohol | Strong, positive. | - |
Let’s build a linear model to predict wine quality using the following properties that have a medium to strong linear relation with quality as predictor variables:
We need two distinct samples from red wine quality data. One sample will be used to train the linear model. The other sample will be used to test the model, and compare its results with the actual evaluation by wine experts.
#1,500 rows in training set
#99 rows in test set
set.seed = 1056
sample.indices = sample(1:nrow(wqr), 1500)
training <- wqr[sample.indices, ]
test <- wqr[-sample.indices, ]
#Linear model
m1 <- lm(quality ~ alcohol, data = training)
m2 <- update(m1, ~ . + volatile.acidity)
m3 <- update(m2, ~ . + sulphates)
The linear model seems to be a good fit for the data based on the summary below:
The R^2 value indicates that 33% of wine quality is due to its three properties - alcohol, volatile acidity and sulphates.
The R^2 value of a model with only alcohol as a predictor variable indicates that 22% of wine quality is due to alcohol alone.
Three significance stars (***) next to each property indicate that it is unlikely that no relationship exists between them and wine quality.
A p-value of 0.000 for each property indicates a very low probability that they are not relevant in predicting wine quality.
mtable(m1, m2, m3)
##
## Calls:
## m1: lm(formula = quality ~ alcohol, data = training)
## m2: lm(formula = quality ~ alcohol + volatile.acidity, data = training)
## m3: lm(formula = quality ~ alcohol + volatile.acidity + sulphates,
## data = training)
##
## ===============================================
## m1 m2 m3
## -----------------------------------------------
## (Intercept) 2.006*** 3.281*** 2.817***
## (0.181) (0.189) (0.201)
## alcohol 0.348*** 0.300*** 0.296***
## (0.017) (0.016) (0.016)
## volatile.acidity -1.465*** -1.303***
## (0.097) (0.100)
## sulphates 0.646***
## (0.104)
## -----------------------------------------------
## R-squared 0.213 0.316 0.334
## adj. R-squared 0.212 0.315 0.332
## sigma 0.712 0.663 0.655
## F 404.764 346.213 249.559
## p 0.000 0.000 0.000
## Log-likelihood -1616.849 -1511.100 -1491.906
## Deviance 758.348 658.617 641.977
## AIC 3239.698 3030.199 2993.813
## BIC 3255.638 3051.452 3020.379
## N 1500 1500 1500
## ===============================================
A test of the model results in residuals that are fairly normally distributed, again indicating that the three properties are significant in predicting wine quality.
#Predict
estimate <- predict(m3, newdata = test, interval = "prediction", level = 0.95)
estimate <- data.frame(estimate)
estimate$actual.quality <- NA
estimate$actual.quality <- test$quality
estimate$residual <- NA
estimate$residual <- estimate$fit - estimate$actual.quality
I started out with an analysis of each individual data element to get an idea of the nature and distribution of its values. Univariate analysis indicated that most wines are of Medium quality, and none of the wines have extremely low or high ratings. Certain properties, such as citric acid and free sulfur dioxide, that can positively impact quality were not found to be prevalent at high rates or levels in the wines.
The analysis then progressed to test the impact of each physicochemical property on quality. Alcohol, volatile acidity, sulphates, citric acid and density were found to have a linear relationship with quality.
Relationships between each individual property were also analysed, revealing correlation between density and alcohol, density and residual sugars, fixed acidity and citric acid, besides others. Many of those relationships could be explained by their physical and chemical attributes.
Multivariate analysis on alcohol, volatile acidity, sulphates, citric acid and density revealed that citric acid and density did not have as strong a linear relationship on wine quality as the other three properties. The correlation coefficients confirmed that finding.
A linear model was built using alcohol, volatile acidity and sulphates as predictor variables. The table of estimates and a plot of the residuals indicated that the model was a good fit.
However, only 33% of wine quality is due to those three properties. It seems natural that more than just 3 of 11 physicochemical properties of wine should determine quality. A larger data set with a greater range of values for certain properties such as citric acid and free sulfur dioxide may allow us to use more predictor variables and build a better model.